Introduction¶

In this practice, I will explore how to visualize and analyze realistic customer behavior data using interactive plots.

I'll begin by generating a synthetic dataset of 5,000 customers, each characterized by three behavioral features: Engagement Score, Purchase Frequency, and Product Interest. These features represent customer interaction levels, buying habits, and interest in products, respectively.

Customers are grouped into three segments: High Value, Mid Value, and Low Value. These segments reflect varying levels of loyalty, frequency of purchase, and potential business value.

I'll visualize the dataset using a 3D scatter plot to observe how customers are distributed in behavioral space. Then, I'll use box plots and categorical bar charts to explore differences between segments by region and gender.

Goal¶

The goal of this lab is to practice how to simulate, visualize, and interpret customer segmentation results in a business context. These techniques are commonly used in marketing analytics, CRM optimization, and targeted campaign strategies.

In [4]:
!pip install numpy==2.2.0
!pip install pandas==2.2.3
!pip install matplotlib==3.9.3
!pip install plotly==5.24.1
!pip install umap-learn==0.5.7
Collecting numpy==2.2.0
  Using cached numpy-2.2.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.2.0-cp312-cp312-win_amd64.whl (12.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.2.0
  WARNING: Failed to remove contents in a temporary directory 'C:\Users\anhbv\anaconda3\Lib\site-packages\~1mpy.libs'.
  You can safely remove it manually.
  WARNING: Failed to remove contents in a temporary directory 'C:\Users\anhbv\anaconda3\Lib\site-packages\~2mpy'.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.2.0 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.0 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.0 which is incompatible.
Requirement already satisfied: pandas==2.2.3 in c:\users\anhbv\anaconda3\lib\site-packages (2.2.3)
Requirement already satisfied: numpy>=1.26.0 in c:\users\anhbv\anaconda3\lib\site-packages (from pandas==2.2.3) (2.2.0)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\anhbv\anaconda3\lib\site-packages (from pandas==2.2.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\anhbv\anaconda3\lib\site-packages (from pandas==2.2.3) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\anhbv\anaconda3\lib\site-packages (from pandas==2.2.3) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\anhbv\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas==2.2.3) (1.16.0)
Requirement already satisfied: matplotlib==3.9.3 in c:\users\anhbv\anaconda3\lib\site-packages (3.9.3)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (1.4.4)
Requirement already satisfied: numpy>=1.23 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (2.2.0)
Requirement already satisfied: packaging>=20.0 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (24.1)
Requirement already satisfied: pillow>=8 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\anhbv\anaconda3\lib\site-packages (from matplotlib==3.9.3) (2.9.0.post0)
Collecting numpy>=1.23 (from matplotlib==3.9.3)
  Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Requirement already satisfied: six>=1.5 in c:\users\anhbv\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib==3.9.3) (1.16.0)
Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl (15.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.0
    Uninstalling numpy-2.2.0:
      Successfully uninstalled numpy-2.2.0
Successfully installed numpy-1.26.4
Requirement already satisfied: plotly==5.24.1 in c:\users\anhbv\anaconda3\lib\site-packages (5.24.1)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\anhbv\anaconda3\lib\site-packages (from plotly==5.24.1) (8.2.3)
Requirement already satisfied: packaging in c:\users\anhbv\anaconda3\lib\site-packages (from plotly==5.24.1) (24.1)
Requirement already satisfied: umap-learn==0.5.7 in c:\users\anhbv\anaconda3\lib\site-packages (0.5.7)
Requirement already satisfied: numpy>=1.17 in c:\users\anhbv\anaconda3\lib\site-packages (from umap-learn==0.5.7) (1.26.4)
Requirement already satisfied: scipy>=1.3.1 in c:\users\anhbv\anaconda3\lib\site-packages (from umap-learn==0.5.7) (1.13.1)
Requirement already satisfied: scikit-learn>=0.22 in c:\users\anhbv\anaconda3\lib\site-packages (from umap-learn==0.5.7) (1.5.1)
Requirement already satisfied: numba>=0.51.2 in c:\users\anhbv\anaconda3\lib\site-packages (from umap-learn==0.5.7) (0.60.0)
Requirement already satisfied: pynndescent>=0.5 in c:\users\anhbv\anaconda3\lib\site-packages (from umap-learn==0.5.7) (0.5.13)
Requirement already satisfied: tqdm in c:\users\anhbv\anaconda3\lib\site-packages (from umap-learn==0.5.7) (4.66.5)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in c:\users\anhbv\anaconda3\lib\site-packages (from numba>=0.51.2->umap-learn==0.5.7) (0.43.0)
Requirement already satisfied: joblib>=0.11 in c:\users\anhbv\anaconda3\lib\site-packages (from pynndescent>=0.5->umap-learn==0.5.7) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\anhbv\anaconda3\lib\site-packages (from scikit-learn>=0.22->umap-learn==0.5.7) (3.5.0)
Requirement already satisfied: colorama in c:\users\anhbv\anaconda3\lib\site-packages (from tqdm->umap-learn==0.5.7) (0.4.6)

Import the required libraries¶

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import seaborn as sns

import umap.umap_ as UMAP 
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

import plotly.express as px
from sklearn.datasets import make_blobs

Generate synthetic data with three clusters in a 3D space¶

In [8]:
# Set seed for reproducibility
np.random.seed(42)

# Define cluster centers for 3 segments
centers = [
    [85, 40, 9],   # High Value
    [60, 25, 6],   # Mid Value
    [35, 10, 3]    # Low Value
]
cluster_std = [5, 6, 8]

# Generate synthetic data
from sklearn.datasets import make_blobs
X, labels = make_blobs(n_samples=1000,
                       centers=centers,
                       cluster_std=cluster_std,
                       n_features=3,
                       random_state=42)

# Clip negative values
X = np.clip(X, a_min=0, a_max=None)

# Create base DataFrame
df = pd.DataFrame(X, columns=["Engagement_Score", "Purchase_Frequency", "Product_Interest"])
df["Segment"] = pd.Series(labels).map({0: "High Value", 1: "Mid Value", 2: "Low Value"})

# Add synthetic Region and Gender
regions = ["North", "South", "East", "West"]
genders = ["Male", "Female"]

df["Region"] = np.random.choice(regions, size=len(df), p=[0.25, 0.25, 0.25, 0.25])
df["Gender"] = np.random.choice(genders, size=len(df), p=[0.5, 0.5])

df.head()
Out[8]:
Engagement_Score Purchase_Frequency Product_Interest Segment Region Gender
0 19.701960 9.450928 0.000000 Low Value South Male
1 72.375150 18.594803 6.145317 Mid Value West Female
2 83.556707 41.613593 4.863845 High Value East Female
3 61.346711 31.282590 16.103566 Mid Value East Female
4 82.552803 45.220804 12.409457 High Value North Female

Display the data in an interactive Plotly 3D scatter plot¶

In [10]:
import plotly.express as px

fig = px.scatter_3d(df,
                    x='Engagement_Score',
                    y='Purchase_Frequency',
                    z='Product_Interest',
                    color='Segment',
                    opacity=0.7,
                    color_discrete_sequence=px.colors.qualitative.G10,
                    title="3D Scatter Plot of Customer Segments")

fig.update_layout(
    scene=dict(
        xaxis_title='Engagement Score (0–100)',
        yaxis_title='Purchase Frequency (0–50)',
        zaxis_title='Product Interest (0–10)'
    ),
    width=900,
    height=700
)

fig.update_traces(marker=dict(size=4, line=dict(width=0.5, color='black')), showlegend=True)
fig.show()

Visualizing Customer Segments by Key Attributes¶

To further explore the differences between customer groups, we visualize the segmentation distribution using box plots and bar charts.

These visualizations help uncover how customer segments vary in:

  • Engagement Score
  • Region
  • Gender

1. Box Plot: Engagement Score by Segment¶

This box plot helps us understand how engagement levels differ across customer segments. The spread, median, and potential outliers provide insight into behavioral variability.

In [13]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Segment', y='Engagement_Score', palette='Set2')
plt.title('Engagement Score by Customer Segment')
plt.ylabel('Engagement Score')
plt.xlabel('Customer Segment')
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()
C:\Users\anhbv\AppData\Local\Temp\ipykernel_32860\870146064.py:2: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


No description has been provided for this image

2. Bar Chart: Customer Segment Distribution by Region¶

This bar chart shows how each customer segment is distributed across different regions. It can support targeting regional marketing campaigns or improving customer experience by location.

In [15]:
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='Region', hue='Segment', palette='Set3')
plt.title('Customer Segments by Region')
plt.ylabel('Number of Customers')
plt.xlabel('Region')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.3)
plt.legend(title='Segment')
plt.tight_layout()
plt.show()
No description has been provided for this image

3. Bar Chart: Customer Segment Distribution by Gender¶

This chart helps reveal if there's any gender skew across the three customer segments. Such insights can inform personalization strategies.

In [17]:
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='Gender', hue='Segment', palette='pastel')
plt.title('Customer Segments by Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Customers')
plt.grid(True, linestyle='--', alpha=0.3)
plt.legend(title='Segment')
plt.tight_layout()
plt.show()
No description has been provided for this image